The business model is a platform (crowd investing) where people who have a business idea, but do not have money they need, can register and raise funds for their project within a given time.
On the other hand, there are donors (creditors or investors) who would like to invest their money in projects and are looking for investments' opportunities.
The platform brings together borrowers and lenders as an intermediary.
Database contains historic data of the platform.
Additional assumptions
All projects are completed, i.e. the time to raise money for each project has expired. In the business model all the collected funds are paid out, even if the target amount has not been reached.
The platform earns money as a commission for each project that is placed with the help of the platform.
The lender receives an interest on the loan.
Description of data fields:
# Importing the necessary libraries
import pandas as pd
import numpy as np
import plotly.express as px
# Data first reading to understand what separator is used
df_input = pd.read_csv("data_abschlussprojekt.csv", engine='python', nrows=2)
df_input
# Separator is # and the first column is index
| # funded_amount# loan_amount# activity# sector# use# country_code# country# region# currency# term_in_months# lender_count# borrower_genders# repayment_interval | |
|---|---|
| 0#300.0#300.0#Fruits & Vegetables#Food#To buy seasonal | fresh fruits to sell. #PK#Pakistan#Lahore#PKR... |
| 1#575.0#575.0#Rickshaw#Transportation#to repair and maintain the auto rickshaw used in their business.#PK#Pakistan#Lahore#PKR#11.0#14#female | female#irregular |
# Reading with correct separator and delete index column
df_input = pd.read_csv("data_abschlussprojekt.csv", sep='#', index_col = 0, engine='python')
df_input.head(2)
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | female, female | irregular |
General Information about Data
df_input.shape
# data has 13 columns and 671205 rows
(671205, 13)
df_input.info()
# types of data, number of not-null data in each column, number of columns of each type, memory usage
<class 'pandas.core.frame.DataFrame'> Int64Index: 671205 entries, 0 to 671204 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 funded_amount 671205 non-null float64 1 loan_amount 671205 non-null float64 2 activity 671205 non-null object 3 sector 671205 non-null object 4 use 666973 non-null object 5 country_code 671197 non-null object 6 country 671205 non-null object 7 region 614405 non-null object 8 currency 671205 non-null object 9 term_in_months 671205 non-null float64 10 lender_count 671205 non-null int64 11 borrower_genders 666984 non-null object 12 repayment_interval 671205 non-null object dtypes: float64(3), int64(1), object(9) memory usage: 71.7+ MB
Evaluation
Numeric data have numeric-type, string data have object-type
General statistical evaluation of data
df_input.describe()
| funded_amount | loan_amount | term_in_months | lender_count | |
|---|---|---|---|---|
| count | 671205.000000 | 671205.000000 | 671205.000000 | 671205.000000 |
| mean | 785.995061 | 842.397107 | 13.739022 | 20.590922 |
| std | 1130.398941 | 1198.660073 | 8.598919 | 28.459551 |
| min | 0.000000 | 25.000000 | 1.000000 | 0.000000 |
| 25% | 250.000000 | 275.000000 | 8.000000 | 7.000000 |
| 50% | 450.000000 | 500.000000 | 13.000000 | 13.000000 |
| 75% | 900.000000 | 1000.000000 | 14.000000 | 24.000000 |
| max | 100000.000000 | 100000.000000 | 158.000000 | 2986.000000 |
Evaluation results
Min -values seems to be correct, Max -values are located too far from median as well as from average, it is necessary to look carefully at the transactions with such a loan size, term and number of lenders
Checking and Correcting Column Names, Values, Data Formatting
df_input.columns
#column names have spaces at the beginning. Deleting with lstrip
df_input.columns= df_input.columns.str.lstrip()
df_input.columns
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'use',
'country_code', 'country', 'region', 'currency', 'term_in_months',
'lender_count', 'borrower_genders', 'repayment_interval'],
dtype='object')
# get the unique values for each column with 'object'-type to check the data
object_liste = df_input.select_dtypes(include='object').columns
for element in object_liste:
print(element,'\n', df_input[element].unique())
activity ['Fruits & Vegetables' 'Rickshaw' 'Transportation' 'Embroidery' 'Milk Sales' 'Services' 'Dairy' 'Beauty Salon' 'Manufacturing' 'Food Production/Sales' 'Wholesale' 'General Store' 'Clothing Sales' 'Poultry' 'Tailoring' 'Sewing' 'Bakery' 'Restaurant' 'Food Stall' 'Farming' 'Construction Supplies' 'Personal Products Sales' 'Home Products Sales' 'Natural Medicines' 'Fish Selling' 'Education provider' 'Shoe Sales' 'Machinery Rental' 'Butcher Shop' 'Pigs' 'Personal Expenses' 'Food Market' 'Cosmetics Sales' 'Personal Housing Expenses' 'Retail' 'Energy' 'Grocery Store' 'Construction' 'Agriculture' 'Motorcycle Transport' 'Charcoal Sales' 'Food' 'Pharmacy' 'Fishing' 'Timber Sales' 'Cattle' 'Electronics Repair' 'Electronics Sales' 'Vehicle' 'Cafe' 'Blacksmith' 'Higher education costs' 'Used Clothing' 'Fuel/Firewood' 'Upholstery' 'Catering' 'Animal Sales' 'Cereals' 'Vehicle Repairs' 'Arts' 'Cloth & Dressmaking Supplies' 'Mobile Phones' 'Spare Parts' 'Clothing' 'Metal Shop' 'Barber Shop' 'Furniture Making' 'Crafts' 'Home Energy' 'Home Appliances' 'Wedding Expenses' 'Taxi' 'Secretarial Services' 'Livestock' 'Property' 'Recycling' 'Farm Supplies' 'Auto Repair' 'Beverages' 'Plastics Sales' 'Electrical Goods' 'Carpentry' 'Photography' 'Jewelry' 'Bricks' 'Pub' 'Phone Use Sales' 'Water Distribution' 'Paper Sales' 'Computers' 'Liquor Store / Off-License' 'Utilities' 'Knitting' 'Weaving' 'Party Supplies' 'Medical Clinic' 'Internet Cafe' 'Consumer Goods' 'Cement' 'Electrician' 'Primary/secondary school costs' 'Veterinary Sales' 'Land Rental' 'Laundry' 'Call Center' 'Perfumes' 'Hotel' 'Motorcycle Repair' 'Movie Tapes & DVDs' 'Quarrying' 'Personal Medical Expenses' 'Bookstore' 'Decorations Sales' 'Recycled Materials' 'Office Supplies' 'Souvenir Sales' 'Renewable Energy Products' 'Health' 'Printing' 'Phone Repair' 'Traveling Sales' 'Flowers' 'Bicycle Repair' 'Entertainment' 'Phone Accessories' 'Hardware' 'Used Shoes' 'Music Discs & Tapes' 'Games' 'Balut-Making' 'Textiles' 'Child Care' 'Goods Distribution' 'Florist' 'Cobbler' 'Dental' 'Bookbinding' 'Cheese Making' 'Bicycle Sales' 'Well digging' 'Technology' 'Musical Performance' 'Waste Management' 'Film' 'Tourism' 'Musical Instruments' 'Religious Articles' 'Machine Shop' 'Cleaning Services' 'Sporting Good Sales' 'Patchwork' 'Funerals' 'Air Conditioning' 'Communications' 'Adult Care' 'Landscaping / Gardening' 'Aquaculture' 'Beekeeping' 'Event Planning' 'Celebrations' 'Computer' 'Personal Care Products' 'Mobile Transactions'] sector ['Food' 'Transportation' 'Arts' 'Services' 'Agriculture' 'Manufacturing' 'Wholesale' 'Retail' 'Clothing' 'Construction' 'Health' 'Education' 'Personal Use' 'Housing' 'Entertainment'] use ['To buy seasonal, fresh fruits to sell. ' 'to repair and maintain the auto rickshaw used in their business.' 'To repair their old cycle-van and buy another one to rent out as a source of income' ... 'Pretend the issue with loan got addressed by Kiva Coordinator.' 'Kiva Coordinator replaced loan use. Should see this in viewdiff.' 'Edited loan use in english.'] country_code ['PK' 'IN' 'KE' 'NI' 'SV' 'TZ' 'PH' 'PE' 'SN' 'KH' 'LR' 'VN' 'IQ' 'HN' 'PS' 'MN' 'US' 'ML' 'CO' 'TJ' 'GT' 'EC' 'BO' 'YE' 'GH' 'SL' 'HT' 'CL' 'JO' 'UG' 'BI' 'BF' 'TL' 'ID' 'GE' 'UA' 'XK' 'AL' 'CD' 'CR' 'SO' 'ZW' 'CM' 'TR' 'AZ' 'DO' 'BR' 'MX' 'KG' 'AM' 'PY' 'LB' 'WS' 'IL' 'RW' 'ZM' 'NP' 'CG' 'MZ' 'ZA' 'TG' 'BJ' 'BZ' 'SR' 'TH' 'NG' 'MR' 'VU' 'PA' 'VI' 'VC' 'LA' 'MW' 'MM' 'MD' 'SS' 'SB' 'CN' 'EG' 'GU' 'AF' 'MG' nan 'PR' 'LS' 'CI' 'BT'] country ['Pakistan' 'India' 'Kenya' 'Nicaragua' 'El Salvador' 'Tanzania' 'Philippines' 'Peru' 'Senegal' 'Cambodia' 'Liberia' 'Vietnam' 'Iraq' 'Honduras' 'Palestine' 'Mongolia' 'United States' 'Mali' 'Colombia' 'Tajikistan' 'Guatemala' 'Ecuador' 'Bolivia' 'Yemen' 'Ghana' 'Sierra Leone' 'Haiti' 'Chile' 'Jordan' 'Uganda' 'Burundi' 'Burkina Faso' 'Timor-Leste' 'Indonesia' 'Georgia' 'Ukraine' 'Kosovo' 'Albania' 'The Democratic Republic of the Congo' 'Costa Rica' 'Somalia' 'Zimbabwe' 'Cameroon' 'Turkey' 'Azerbaijan' 'Dominican Republic' 'Brazil' 'Mexico' 'Kyrgyzstan' 'Armenia' 'Paraguay' 'Lebanon' 'Samoa' 'Israel' 'Rwanda' 'Zambia' 'Nepal' 'Congo' 'Mozambique' 'South Africa' 'Togo' 'Benin' 'Belize' 'Suriname' 'Thailand' 'Nigeria' 'Mauritania' 'Vanuatu' 'Panama' 'Virgin Islands' 'Saint Vincent and the Grenadines' "Lao People's Democratic Republic" 'Malawi' 'Myanmar (Burma)' 'Moldova' 'South Sudan' 'Solomon Islands' 'China' 'Egypt' 'Guam' 'Afghanistan' 'Madagascar' 'Namibia' 'Puerto Rico' 'Lesotho' "Cote D'Ivoire" 'Bhutan'] region ['Lahore' 'Maynaguri' 'Abdul Hakeem' ... 'Gbenikoro Village' 'Morimaraia' 'alejandria'] currency ['PKR' 'INR' 'KES' 'NIO' 'USD' 'TZS' 'PHP' 'PEN' 'XOF' 'LRD' 'VND' 'HNL' 'MNT' 'COP' 'GTQ' 'TJS' 'BOB' 'YER' 'KHR' 'GHS' 'SLL' 'HTG' 'CLP' 'JOD' 'UGX' 'BIF' 'IDR' 'GEL' 'UAH' 'EUR' 'ALL' 'CRC' 'XAF' 'TRY' 'AZN' 'DOP' 'BRL' 'MXN' 'KGS' 'AMD' 'PYG' 'LBP' 'WST' 'ILS' 'RWF' 'ZMW' 'NPR' 'MZN' 'ZAR' 'BZD' 'SRD' 'NGN' 'VUV' 'XCD' 'MWK' 'LAK' 'MMK' 'ZWD' 'MDL' 'SSP' 'SBD' 'CNY' 'EGP' 'MGA' 'NAD' 'LSL' 'THB'] borrower_genders ['female' 'female, female' 'female, female, female' ... 'female, female, male, female, female, female, female, female, female, female, male, male, female, female, male, female, female, female, female, female, female, female' 'male, female, female, female, female, female, female, female, male, male, female, male, female, male, male, male' 'female, female, female, male, female, female, female, male, female, female, female, male, female, male, female, female, female, female, female, female, female, female, female, female, female, female, female, female, male'] repayment_interval ['irregular' 'bullet' 'monthly' 'weekly']
Evaluation
At first look there are no explicit missing values. Column 'use' has extra spaces to be deleted.
# Deleting of extra spaces
df_input['use'].str.strip()
0 To buy seasonal, fresh fruits to sell.
1 to repair and maintain the auto rickshaw used ...
2 To repair their old cycle-van and buy another ...
3 to purchase an embroidery machine and a variet...
4 to purchase one buffalo.
...
671200 [True, u'para compara: cemento, arenya y ladri...
671201 [True, u'to start a turducken farm.'] - this l...
671202 NaN
671203 [True, u'to start a turducken farm.'] - this l...
671204 [True, u'to start a turducken farm.'] - this l...
Name: use, Length: 671205, dtype: object
df_input.head(5)
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | female, female | irregular |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | female | bullet |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | female | irregular |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | female | monthly |
df_duplicates = df_input.loc[df_input.duplicated(),:]
df_duplicates
# checking some data carefully
df_duplicates.query("activity == 'Home Energy' and sector == 'Personal Use' and country == 'El Salvador'")
df_duplicates.query("sector == 'Education' and country=='Kenya' and activity == 'Higher education costs' and lender_count == 18")
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 363782 | 500.0 | 500.0 | Higher education costs | Education | to pay for university tuition | KE | Kenya | Nairobi | KES | 31.0 | 18 | female | monthly |
| 374737 | 500.0 | 500.0 | Higher education costs | Education | to pay for university fees | KE | Kenya | Nairobi | KES | 32.0 | 18 | female | monthly |
| 387675 | 500.0 | 500.0 | Higher education costs | Education | to pay for university fees | KE | Kenya | Nairobi | KES | 40.0 | 18 | female | monthly |
| 387899 | 500.0 | 500.0 | Higher education costs | Education | to pay for university fees | KE | Kenya | Nairobi | KES | 40.0 | 18 | female | monthly |
| 387977 | 500.0 | 500.0 | Higher education costs | Education | to pay for university fees. | KE | Kenya | Nairobi | KES | 40.0 | 18 | male | monthly |
| 408822 | 500.0 | 500.0 | Higher education costs | Education | to pay for university fees | KE | Kenya | Nairobi | KES | 40.0 | 18 | male | monthly |
| 574734 | 500.0 | 500.0 | Higher education costs | Education | to pay for university fees. | KE | Kenya | Nairobi | KES | 40.0 | 18 | male | monthly |
Evaluation of duplicates
The data has 24372 duplicated rows (3,6% of all the data). The data does not contain information on unique code of each project and unique launch date of the project, which could tell us whether we are dealing with duplicates or similar deals. Since we cannot explicitly define whether these are duplicates or identical projects committed by different people with different start time, then by deleting them we may lose a useful information.
Deсision: not to delete this data
df_input["borrower_genders"].unique()
# Column borrower_genders includes not only gender values, but also lists of groups of people with different
# or similar gender
array(['female', 'female, female', 'female, female, female', ...,
'female, female, male, female, female, female, female, female, female, female, male, male, female, female, male, female, female, female, female, female, female, female',
'male, female, female, female, female, female, female, female, male, male, female, male, female, male, male, male',
'female, female, female, male, female, female, female, male, female, female, female, male, female, male, female, female, female, female, female, female, female, female, female, female, female, female, female, female, male'],
dtype=object)
# checking data carefully
df_input.query("borrower_genders == 'female, female, female'")
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 17 | 875.0 | 875.0 | Tailoring | Services | To buy a sewing machine, lace, zippers and but... | PK | Pakistan | Lahore | PKR | 11.0 | 25 | female, female, female | irregular |
| 207 | 800.0 | 800.0 | Clothing Sales | Clothing | to invest in working capital to stock their bu... | NI | Nicaragua | Leon | NIO | 9.0 | 7 | female, female, female | monthly |
| 379 | 2650.0 | 2650.0 | Clothing Sales | Clothing | to buy new ladies' clothes and shoes. | TZ | Tanzania | Dar es Salaam | TZS | 10.0 | 67 | female, female, female | irregular |
| 427 | 900.0 | 900.0 | Farming | Agriculture | to pay a plowing fee and a weeding fee and to ... | KH | Cambodia | Kampong Cham Province,Memot district | KHR | 13.0 | 32 | female, female, female | monthly |
| 515 | 675.0 | 675.0 | Home Products Sales | Retail | To buy more variety of bed sheets to sell. | PK | Pakistan | Rawalpindi | PKR | 12.0 | 23 | female, female, female | irregular |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 670632 | 225.0 | 2850.0 | Cosmetics Sales | Retail | to invest in her business by buying cosmetics,... | NI | Nicaragua | Managua | NIO | 14.0 | 5 | female, female, female | irregular |
| 670683 | 2750.0 | 2750.0 | Clothing Sales | Clothing | to buy clothes to sell. | NI | Nicaragua | Managua | NIO | 8.0 | 22 | female, female, female | irregular |
| 670957 | 0.0 | 1375.0 | Clothing Sales | Clothing | to buy and sell traditional clothing such as g... | GT | Guatemala | Suchitepéquez | GTQ | 14.0 | 0 | female, female, female | monthly |
| 670979 | 0.0 | 625.0 | Clothing Sales | Clothing | to buy and sell traditional clothing such as g... | GT | Guatemala | Suchitepéquez | GTQ | 14.0 | 0 | female, female, female | monthly |
| 671020 | 225.0 | 700.0 | Agriculture | Agriculture | to buy chickens to sell. | GT | Guatemala | Suchitepéquez | GTQ | 14.0 | 6 | female, female, female | monthly |
11676 rows × 13 columns
# Checking: do we have projects where funded_amount == 0 and one or more lender at the same time
# Checking: do we have projects where funded_amount > 0 and there are no lender at the same time
df_input.query('(funded_amount == 0 and lender_count > 0) or (funded_amount > 0 and lender_count == 0)').shape[0]
# We have no such projects
0
# check what fields have nan-values
df_input.isna().sum().to_frame(name='nan').query('nan > 0')
| nan | |
|---|---|
| use | 4232 |
| country_code | 8 |
| region | 56800 |
| borrower_genders | 4221 |
# calulating the number of unique values in 'use'-column
print(f"Number of unique values in 'use'-column: {len(df_input['use'].unique())}")
Number of unique values in 'use'-column: 424913
Evaluation
all the data that include nan-values have object-type:
idea#1: we are dealing with interval scale, try to change for the most common value for each sector
(with mode function)
idea#2: delete the data
- borrow_genders: by deleting we may lose the useful data --> choose idea#1
- use: the data contains comments to transactions. There is no sense in replacing with the most common value by
groups. The column contains 424913 unique values. The not nan-values can be left as comments or the column can
be deleted.
- region: the data provides an additional information about the region of the transaction within single country.
There is no point in replacing the nan-values with the mode-calculated value for each country. It may be far from
truth. The not nan-values can be left as an additional information or the whole column can be deleted.
# deleting columns 'region' and 'use' to save memory and speed up our analysis
df_input = df_input.drop(columns={'use','region'})
# checking for country_code
df_input.query("country_code.isna()")
df_input.query("country=='Namibia' or country_code=='NAM'")
| funded_amount | loan_amount | activity | sector | country_code | country | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 202537 | 4150.0 | 4150.0 | Wholesale | Wholesale | NaN | Namibia | NAD | 6.0 | 162 | female | bullet |
| 202823 | 4150.0 | 4150.0 | Wholesale | Wholesale | NaN | Namibia | NAD | 6.0 | 159 | male | bullet |
| 344929 | 3325.0 | 3325.0 | Wholesale | Wholesale | NaN | Namibia | NAD | 7.0 | 120 | female | bullet |
| 351177 | 3325.0 | 3325.0 | Wholesale | Wholesale | NaN | Namibia | NAD | 7.0 | 126 | male | bullet |
| 420953 | 3325.0 | 3325.0 | Wholesale | Wholesale | NaN | Namibia | NAD | 7.0 | 118 | female | bullet |
| 421218 | 4000.0 | 4000.0 | Wholesale | Wholesale | NaN | Namibia | NAD | 7.0 | 150 | male | bullet |
| 487207 | 5100.0 | 5100.0 | Renewable Energy Products | Retail | NaN | Namibia | NAD | 7.0 | 183 | male | bullet |
| 487653 | 5000.0 | 5000.0 | Wholesale | Wholesale | NaN | Namibia | NAD | 7.0 | 183 | female | bullet |
# Namibia has ISO-Code NAM
# we have no rows with NAM-Code, but we have the projects from Namibia
# replacing nan-values for projects from Namibia with NAM
mask = (df_input["country_code"].isnull()) & (df_input["country"]=='Namibia')
df_input.loc[mask,"country_code"] = 'NAM'
# checking the result
df_input.loc[df_input["country"]=='Namibia',:]
| funded_amount | loan_amount | activity | sector | country_code | country | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 202537 | 4150.0 | 4150.0 | Wholesale | Wholesale | NAM | Namibia | NAD | 6.0 | 162 | female | bullet |
| 202823 | 4150.0 | 4150.0 | Wholesale | Wholesale | NAM | Namibia | NAD | 6.0 | 159 | male | bullet |
| 344929 | 3325.0 | 3325.0 | Wholesale | Wholesale | NAM | Namibia | NAD | 7.0 | 120 | female | bullet |
| 351177 | 3325.0 | 3325.0 | Wholesale | Wholesale | NAM | Namibia | NAD | 7.0 | 126 | male | bullet |
| 420953 | 3325.0 | 3325.0 | Wholesale | Wholesale | NAM | Namibia | NAD | 7.0 | 118 | female | bullet |
| 421218 | 4000.0 | 4000.0 | Wholesale | Wholesale | NAM | Namibia | NAD | 7.0 | 150 | male | bullet |
| 487207 | 5100.0 | 5100.0 | Renewable Energy Products | Retail | NAM | Namibia | NAD | 7.0 | 183 | male | bullet |
| 487653 | 5000.0 | 5000.0 | Wholesale | Wholesale | NAM | Namibia | NAD | 7.0 | 183 | female | bullet |
# changing the values where borrower_genders has more than one gender to a 'group_of_persons'
df_input.groupby("borrower_genders").agg({"borrower_genders":'count'})
mask = df_input["borrower_genders"].str.contains(',')
df_input.loc[mask==True, "borrower_genders"] = "group_of_persons"
df_input["borrower_genders"].unique()
array(['female', 'group_of_persons', 'male', nan], dtype=object)
# look: what is the most common value for each sector?
df_grp = (df_input.groupby(by=["sector","borrower_genders"],as_index=False)
.agg(count = ("borrower_genders","count"))
.sort_values(by=["sector","count"],ascending=[True,False])
)
sector_liste = df_input.query('borrower_genders.isna()')['sector'].unique()
print(sector_liste,"\n")
for sector in sector_liste:
mode_val= df_input.loc[df_input['sector']==sector,"borrower_genders"].mode()[0]
print(f"{sector}: {mode_val}")
['Food' 'Personal Use' 'Services' 'Agriculture' 'Housing' 'Retail' 'Clothing' 'Arts' 'Manufacturing' 'Construction' 'Education' 'Entertainment' 'Health' 'Transportation' 'Wholesale'] Food: female Personal Use: group_of_persons Services: female Agriculture: female Housing: female Retail: female Clothing: female Arts: female Manufacturing: female Construction: male Education: female Entertainment: male Health: female Transportation: female Wholesale: female
# replacing the nan-values with the most common value for each sector
for sector in sector_liste:
mode_val= df_input.loc[df_input['sector']==sector,"borrower_genders"].mode()[0]
mask = (df_input["borrower_genders"].isnull()) & (df_input['sector']==sector)
df_input.loc[mask,"borrower_genders"] = mode_val
#check: do we still have nan-values in borrower_genders?
df_input.loc[df_input["borrower_genders"].isnull(),"borrower_genders"]
Series([], Name: borrower_genders, dtype: object)
# check: what nan-values do we have?
df_input.isnull().sum()
funded_amount 0 loan_amount 0 activity 0 sector 0 country_code 0 country 0 currency 0 term_in_months 0 lender_count 0 borrower_genders 0 repayment_interval 0 dtype: int64
# make a copy of dataset to compare further the result of optimization
df_opti=df_input.copy()
# checking the maximum float values
df_input.funded_amount.max(),df_input.funded_amount.max()
(100000.0, 100000.0)
# can be to float32 converted (max_value for float32: 3.4E+38, max_value for float16: 65536)
df_opti[["funded_amount","loan_amount"]] = df_opti[["funded_amount","loan_amount"]].astype("float32")
df_opti.dtypes
funded_amount float32 loan_amount float32 activity object sector object country_code object country object currency object term_in_months float64 lender_count int64 borrower_genders object repayment_interval object dtype: object
# checking for maximum values to make a decision about type depreciation
df_opti["term_in_months"].max(),df_opti["lender_count"].max()
(158.0, 2986)
# check: Does 'term_in_months' have not-whole values?
for i in df_opti["term_in_months"]:
if i%1 != 0:
print("Can't be to integer converted")
print("finished")
# can be to int16 converted (max_value for int16: 32767, max_value for int8: 127)
df_opti[["term_in_months","lender_count"]]= df_opti[["term_in_months","lender_count"]].astype("int16")
df_opti.dtypes
finished
funded_amount float32 loan_amount float32 activity object sector object country_code object country object currency object term_in_months int16 lender_count int16 borrower_genders object repayment_interval object dtype: object
#check: what data can be to category-type converted?
object_liste = df_opti.select_dtypes(include='object').columns
for element in object_liste:
print(element, len(df_opti[element].unique()))
activity 163 sector 15 country_code 87 country 87 currency 67 borrower_genders 3 repayment_interval 4
# change the chosen data to category type
objects_to_category = ['sector','country_code','country','currency','borrower_genders','repayment_interval','activity']
for element in objects_to_category:
df_opti[element] = df_opti[element].astype('category')
df_opti.dtypes
funded_amount float32 loan_amount float32 activity category sector category country_code category country category currency category term_in_months int16 lender_count int16 borrower_genders category repayment_interval category dtype: object
# creating a correlation matrix
list_of_numeric_columns =df_opti.select_dtypes(include='number').columns
df_corr = df_opti[list_of_numeric_columns].corr(method="pearson")
df_corr
| funded_amount | loan_amount | term_in_months | lender_count | |
|---|---|---|---|---|
| funded_amount | 1.000000 | 0.945044 | 0.149310 | 0.849168 |
| loan_amount | 0.945044 | 1.000000 | 0.184795 | 0.798697 |
| term_in_months | 0.149310 | 0.184795 | 1.000000 | 0.227283 |
| lender_count | 0.849168 | 0.798697 | 0.227283 | 1.000000 |
# building a plot for better understanding
fig_corr =px.imshow(df_corr, x =df_corr.columns,y=df_corr.index, text_auto='.2f')
fig_corr.update_layout(title_text = f"Correlation Matrix"
,title_font_size =20
,font_size=15
,title_x =0.5
)
fig_corr.show()
Descriptive part
The larger the number and the lighter the square, the greater the correlation between two columns (positive correlation). We can talk about the strong correlation when a number is greater than 0.7.
Analytical part
There is a strong positive correlation between the loan amount and the amount of money collected, which indicates that the larger the loan amount, the greater the amount collected. That tells us that the most of projects independent of requested amount were successfull in part or in full.
The strong positive correlation is between the loan amount and the number of creditors. The larger the loan amount, the greater the number of creditors of the project. That means that investors do not want to invest much in one project and divide risks by investing small amounts in different projects.
The correlation between loan term and funding is weak, that tells us the probability to raise funds for a project is weakly dependent on the loan term.
# Percent of projects received the requested amount in full to total
percent_of_full = (df_opti.query('funded_amount == loan_amount')['funded_amount'].count()/
df_opti['funded_amount'].count())*100
print(f"Percentage of Projects collected the full amount: {percent_of_full:.2f}%")
Percentage of Projects collected the full amount: 92.80%
df_opti.dtypes
funded_amount float32 loan_amount float32 activity category sector category country_code category country category currency category term_in_months int16 lender_count int16 borrower_genders category repayment_interval category dtype: object
# building a box-plot to look at the extreme funded amounts by sectors
fig_boxplot = px.box(df_opti
,x="sector"
,y="funded_amount"
,title="Box plot"
,hover_name='sector'
,hover_data={'sector':False}
)
fig_boxplot.add_hline(
y=50000,
line_width=2,
line_dash="dash",
line_color="red")
fig_boxplot.update_yaxes(title_text='Funded Amount, in USD')
fig_boxplot.update_xaxes(title_text="Sector")
fig_boxplot.update_layout(title_font_size =20
,font_size=16
,title_x =0.5
)
fig_boxplot.show()